{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Modify the report's structure\n",
    "In this notebook we have a look at two use cases in which we modify the existing report structure: splitting up large reports and reordering the report sections. Both use cases are based on actual user inquiries. The datasets used in this notebook are obtained using the `kaggle` api. If you haven't done so already, you should set up the [api credentials](https://github.com/Kaggle/kaggle-api#api-credentials)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure that we have the latest version of pandas-profiling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "!{sys.executable} -m pip install -U pandas-profiling[notebook]\n",
    "!jupyter nbextension enable --py widgetsnbextension"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might want to restart the kernel now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import kaggle\n",
    "\n",
    "from pandas_profiling.utils.common import extract_zip\n",
    "\n",
    "kaggle.api.authenticate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reorder Sections\n",
    "\n",
    "We can leverage the same approach to reorder sections. First, we need to generate a profile report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We are using the Craigslist Carstrucks Data\n",
    "vehicles_dataset = Path(\"craigslist-carstrucks-data/vehicles.csv\")\n",
    "\n",
    "# Download and extract (~228M)\n",
    "if not vehicles_dataset.exists():\n",
    "    kaggle.api.dataset_download_files(\n",
    "        \"austinreese/craigslist-carstrucks-data\",\n",
    "        path=\"craigslist-carstrucks-data\",\n",
    "        quiet=False,\n",
    "    )\n",
    "\n",
    "    extract_zip(\n",
    "        \"craigslist-carstrucks-data/craigslist-carstrucks-data.zip\",\n",
    "        \"craigslist-carstrucks-data/\",\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas_profiling import ProfileReport\n",
    "\n",
    "# For our demonstration, we only take a fraction of the dataset\n",
    "df = pd.read_csv(vehicles_dataset, nrows=100)\n",
    "\n",
    "# Generate the profile report\n",
    "vehicles_report = ProfileReport(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The structure of the report is stored in the `report` attribute. The report is essentially a tree object. We inspect the root of the report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(repr(vehicles_report.report))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the root object is of the type \"Sequence\". Sequence types have at least the attributes `name` and `items`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(vehicles_report.report.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we would like to pull up the samples section, so that the reordered sequence items are:\n",
    "- Overview\n",
    "- Samples\n",
    "- Missing\n",
    "- Variables\n",
    "- Interactions\n",
    "- Correlations\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reorder the sections\n",
    "vehicles_report.report.content[\"items\"] = [vehicles_report.report.content[\"items\"][i] for i in [0, 5, 1, 2, 3, 4]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can render the report and see that the changes have taken place:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vehicles_report.to_notebook_iframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split Profile Reports\n",
    "\n",
    "When profiling large datasets, a monolithic HTML file can become enormous. Using the report structure generated by `pandas-profiling`, we create a modular report. In this notebook we demonstrate how to split up a profile report in multiple different titles. We start with generating the report's structure in the usual way. The `minimal` mode is set to `True`. This step may take a few minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We are using the IEEE Fraud Detection transaction training data\n",
    "ieee_dataset = Path(\"ieee-fraud-detection/train_transaction.csv\")\n",
    "\n",
    "# Download and extract (~118M)\n",
    "if not ieee_dataset.exists():\n",
    "    kaggle.api.competition_download_files(\n",
    "        \"ieee-fraud-detection\", path=\"ieee-fraud-detection\", quiet=False\n",
    "    )\n",
    "\n",
    "    extract_zip(\n",
    "        \"ieee-fraud-detection/ieee-fraud-detection.zip\",\n",
    "        \"ieee-fraud-detection/\",\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pandas_profiling import ProfileReport\n",
    "\n",
    "# Read the dataset\n",
    "df = pd.read_csv(ieee_dataset)\n",
    "\n",
    "# Generate the profile report\n",
    "ieee_fraud_report = ProfileReport(df, minimal=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(repr(ieee_fraud_report.report))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(repr(ieee_fraud_report.report.content))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from copy import deepcopy\n",
    "\n",
    "# Make a copy for the original report structure\n",
    "original_report_structure = deepcopy(ieee_fraud_report.report)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Loop over each section\n",
    "for section in original_report_structure.content['items']:\n",
    "    # Only consider sections that contain items\n",
    "#     if len(section.content['items']) > 0:\n",
    "    # Set the report structure\n",
    "    ieee_fraud_report.report = deepcopy(original_report_structure)\n",
    "    # Overwrite the section lists with the section we would like to print\n",
    "    ieee_fraud_report.report.content['items'] = [section]\n",
    "    # Output the report to HTML\n",
    "    ieee_fraud_report.to_file(\"ieee_fraud_report_section_{name}.html\".format(name=section.name.lower()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Paginate variables\n",
    "We can use the same approach to paginate variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ieee_fraud_report.report = original_report_structure\n",
    "\n",
    "# Number of variables per page\n",
    "page_size = 25\n",
    "\n",
    "# The Root node, which is a sequence of sections\n",
    "print(repr(ieee_fraud_report.report.content['items']))\n",
    "\n",
    "# The variables\n",
    "variable_section = ieee_fraud_report.report.content['items'][1]\n",
    "variables = variable_section.content['items']\n",
    "variable_count = len(variables)\n",
    "print('Number of variables: {count}'.format(count=variable_count))\n",
    "\n",
    "# Reset the report structure\n",
    "ieee_fraud_report.report = deepcopy(original_report_structure)\n",
    "\n",
    "# Only keep the variables section\n",
    "ieee_fraud_report.report.content['items'] = [ieee_fraud_report.report.content['items'][1]]\n",
    "\n",
    "for page_num, variable_page in enumerate([variables[i:i + page_size] for i in range(0, variable_count, page_size)]):\n",
    "    print(\"Write page {page}\".format(page=page_num))\n",
    "    # Set the report title\n",
    "    ieee_fraud_report.title = \"IEEE Fraud Detection Dataset, Variables, Page {page}\".format(page=page_num)\n",
    "    \n",
    "    # Overwrite the variables lists with the variables we would like to print\n",
    "    ieee_fraud_report.report.content['items'][0].content['items'] = variable_page\n",
    "    \n",
    "    # Output the report to HTML\n",
    "    ieee_fraud_report.to_file(\"ieee_fraud_report_variables_page_{page}.html\".format(page=page_num))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we have seen two ways of manipulating the report structure. Advanced users may alter the structure in other ways we have not touched, such as exploring deeper parts of the tree structure or inserting and deleting objects."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
